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Abstract 

Background: In mass spectrometry-based proteomics, the statistical significance of a peptide-spectrum or protein- 
spectrum match is an important indicator of the correctness of the peptide or protein identification. In bottom-up 
mass spectrometry, probabilistic models, such as the generating function method, have been successfully applied 
to compute the statistical significance of peptide-spectrum matches for short peptides containing no post- 
translational modifications. As top-down mass spectrometry, which often identifies intact proteins with post- 
translational modifications, becomes available in many laboratories, the estimation of statistical significance of top- 
down protein identification results has come into great demand. 

Results: In this paper, we study an extended generating function method for accurately computing the statistical 
significance of protein-spectrum matches with post-translational modifications. Experiments show that the 
extended generating function method achieves high accuracy in computing spectral probabilities and false 
discovery rates. 

Conclusions: The extended generating function method is a non-trivial extension of the generating function 
method for bottom-up mass spectrometry. It can be used to choose the correct protein-spectrum match from 
several candidate protein-spectrum matches for a spectrum, as well as separate correct protein-spectrum matches 
from incorrect ones identified from a large number of tandem mass spectra. 



Background 

Peptide and protein identification in mass spectrometry 
(MS) -based proteomics involves searching tandem mass 
spectrometry (MS/MS) spectra against a protein data- 
base using a search engine. In bottom-up MS, most 
search engines calculate a similarity score between a 
spectrum and a peptide and report a best-scoring pep- 
tide-spectrum match (PSM) for each spectrum [1-5]. A 
PSM is correct if the spectrum is generated from the 
matched peptide. It is vital to distinguish correct PSMs 
from those incorrect ones. 

Two main approaches have been proposed to address 
this problem. In the first approach, a large data set of 
MS/MS spectra is searched against a concatenated tar- 
get-decoy protein database to find a best-scoring PSM 
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for each spectrum, and the PSM is reported if its score 
exceeds a prespecified threshold. The false discovery 
rate (FDR) of the reported PSMs is estimated based on 
the fact that the number of decoy hits and the number 
of incorrect target hits are approximately the same [6]. 
This approach is simple and powerful when a large 
population of PSMs is reported. However, it fails to 
decide the correctness of single PSMs. In addition, it is 
unable to compute accurate FDRs when the target pro- 
tein database is small (e.g., a database with only one 
protein) or when only a small number of PSMs are 
reported [7]. 

In the second approach, the statistical significance (£- 
value or /?-value) of a PSM is computed for determining 
the correctness of the PSM. Due to the complexity of 
MS/MS spectra, many statistical models have limited 
accuracy. By contrast, Kim et al. proposed a probabilis- 
tic method for computing spectral probabilities and 
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statistical significance of PSMs [8]. This method 
achieves high accuracy, but it is not obvious how to 
extend it to PSMs with post-translational modifications 
(PTMs). 

With the rapid developments in instrumentation, top- 
down MS, which analyzes intact proteins or long pep- 
tides, has become available in many laboratories. More 
than a thousand proteins can be identified in a single 
top-down MS experiment [9] and many methods have 
been proposed for identification of proteoforms using 
top-down tandem mass spectra [10-17]. Although the 
evaluation of PSMs in bottom-up MS has been inten- 
sively studied, no systematic studies have been carried 
out for evaluating protein-spectrum matches (PrSMs) in 
top-down MS. Similar to bottom-up MS, there is now 
an increasing demand to accurately estimate the statisti- 
cal significance of single PrSMs. For instance, a top- 
down MS/MS spectrum can be matched to two different 
proteins: one contains a PTM; the other does not. Com- 
paring the £-values of the two PrSMs can determine 
which one is better. Meng et al. developed a Poisson 
model for the problem, but the model does not include 
PTMs [18]. As top-down MS/MS spectra are often 
mapped to proteoforms with PTMs, accurate estimation 
of statistical significance of PrSMs with PTMs is useful 
and challenging. We proposed a method for £-value 
computation of PrSMs by breaking a protein into several 
sub-proteins without PTMs, but it is extremely time 
consuming [17]. In this paper, we study an extended 
generating function method for accurately computing 
spectral probabilities and statistical significance of 
PrSMs in top-down MS. Our method naturally extends 
the generating function method in bottom-up MS [8]. 
Spectral probabilities reported by the extended generat- 
ing function method are further utilized for estimating 
FDRs of identified PrSMs using a method proposed in 
[7], in which decoy databases are not needed. Experi- 
ments show that the extended generating function 
method achieves high accuracy in computing spectral 
probabilities and FDRs. 

Methods 

A top-down MS/MS spectrum generated from a protein 
consists of a precursor mass, corresponding to the mole- 
cular mass of the protein, and a list of peaks, corre- 
sponding to fragment ions of the protein. Each peak 
represents the mass-to-charge ratio and the abundance 
of the fragment ion. The residue mass of a spectrum S 
is defined as M(S) = PrecursorMass - WaterMass, 
where PrecursorMass is the precursor mass of the spec- 
trum, and WaterMass is the mass of a water molecule. 
Because top-down MS/MS spectra are very complex, 
and the charge states of most fragment ions are high, 
high mass resolution and high mass accuracy spectra 



are absolutely required. The first step in top-down spec- 
tral interpretation is usually spectral deconvolution, 
which converts a complex top-down spectrum to a list 
of monoisotopic neutral masses (a deconvoluted spec- 
trum) [19,20]. The neutral masses are further converted 
to a list of prefix residue masses (PRMs) corresponding 
to the masses of protein prefixes [21]. For a collision- 
induced dissociation (CID) spectrum, the PRM spectrum 
is generated as follows: (1) the residue mass of the 
experimental spectrum is added to the PRM spectrum; 
(2) for each neutral mass x extracted from the experi- 
mental spectrum, two masses x and PrecursorMass - x 
are added to the PRM spectrum. If mass x corresponds 
to a protein suffix (prefix), then mass PrecursorMass - 
X corresponds to a protein prefix (suffix) [22]. The pro- 
posed extended generating function method can be 
applied to all types of spectra, such as CID and elec- 
tron-transfer dissociation (ETD) spectra, because all 
these types of spectra can be converted to PRM spectra. 
All masses in PRM spectra are discretized by scaling the 
masses with a constant and rounding the values to inte- 
gers [23]. For highly accurate top-down spectra, a scal- 
ing constant 274.335215 is used (e.g. mass(G) = 
57.021464 X 274.335215 = 15642.995586 « 15643) to 
reduce the rounding error to 2.5 parts per million 
(ppm) [22]. In the following analysis, we assume that 
only PRM spectra with integer masses are studied and 
peak intensities are ignored. 

Scores of PrSMs 

A PRM spectrum S is represented as an ordered list of 
integer masses, in which the largest one is M(S). Let TZ 
be the set of the 20 standard amino acids with integer 
residue masses M(r) for r &Ti (the residue masses of 
amino acids are discretized using the same discretization 
method for PRM spectra). The residue mass of r is also 
denoted as ||r||. The residue mass M{P) of a protein P is 
the sum of the residue masses of all amino acids of the 
protein. It differs from the molecular mass of the protein 
by the mass of a water molecule. A protein P with m 
amino acids is associated with an ordered list of integer 
masses pi < p2 <■ ■ ■ < Pm> where is the sum of the resi- 
due masses of the first i amino acids and p„ = M(/'). 

If the residue masses of spectrum 5 and protein P are the 
same value N , the mass count score of S and P is the num- 
ber of shared masses (except for the residue mass N) mS 
and P, denoted by CScore(S, P). The mass shift of a PTM 
is the mass difference between the modified form (with the 
PTM) and the unmodified form of an amino acid residue. 
When a PTM occurs at the jth amino acid of P and the 
mass shift d of the PTM is positive, the modified form of P 
is denoted by QtJJ^) = \pi,P2, • ■ ■,Pi-\,Pi + d, . . ., p„ + d\. 
When the mass shift of the PTM is a negative value -d, Q,_ 
-AP) = \Pi> P2> ■ ■ Pi-i> Pi - d, . . ., Pm - d}. In addition. 
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if a mass in pi - d, . . ., - d is negative or not greater 
than Pi-i, the mass is removed from Q/,_rf(P). Let 
Qd{P) = {Qi.d{P). ■■■■ Qm,d(P)} be the set of all modified 
proteins of P with a PTM of mass shift d. When the pro- 
tein is not ambiguous, we use shortened notations 
Qd = {Qi,d/ ■ • ■ / Qm,d}- To identify an experimental PRM 
spectrum 5 generated from protein P with a PTM, one can 
find the mass shift d of the PTM by comparing the residue 
masses of S and P , and compute the mass count score 
between S and each of the modified proteins in to find 
the best match. The PTM mass count score of S and P is 
defined as PScore(S,P) = maxQ^Q^ CScore(5, Q), where 
d = M(S) - M(P). 

Random proteins 

Let Pr(r) be the probability that an amino acid r e 7?. is 
observed at a position in a random protein. In practice, 
the frequencies of amino acids in the Swiss-Prot data- 
base [24] can be used to estimate Pr(r). The probability 
that a random protein P with amino acids r-^r-i . . . r,,^ is 
observed is 

nm 
. J Pr(ri)/ 

where L represents the length of the random protein. 
To simplify computation, a uniform probability Pr(L = 
m) = 1/MaxLength is chosen, where MaxLength is the 
length of the longest protein in the Swiss-Prot database. 
Despite the difference between the uniform distribution 
and the distribution of protein length in the target pro- 
tein database, experimental results showed the uniform 
distribution does not introduce large errors into the 
computation of spectral probabilities. 

Spectral probabilities 

Let I'D* be the set of negative/positive mass shifts of 
allowed PTMs. Any number in 23 = V~ U 2?* is a valid 
mass shift. Let S be an experimental PRM spectrum and 
P a random protein. The residue mass difference 
between S and P is a random variable D = M(S') - M(P 
). The spectral probability of S with respect to a thresh- 
old t and one PTM is the probability that the residue 
mass difference D e 23 and PScore(S, P) > t: 

SpecProb(S, t, 1) = Pr(D e V and PScore(S, P) > t) 

= Pr(D = d and PScore(S, P) > £) 

where 1 in SpecProb(S, t, 1) represents one PTM. 
From the definition of PScore(5, P), 

SpecProbfS, t, 1) - ^PrfD-dand (CScore(S, Qm) > tor. .. orCScoiefS, Q,,( > t)). 

Computing SpecProb(S, t, 1) accurately and effi- 
ciently is a problem that has not been solved. In the 
following subsections, we propose two upper bounds 



of SpecProb(5, t, 1). The two upper bounds can be cal- 
culated accurately and efficiently using dynamic pro- 
gramming algorithms. The second upper bound is 
better than the first one and is used for estimating 
SpecProb(5, t, 1). Since the second upper bound is lar- 
ger than SpecProb(5, t, 1), a constant K is introduced 
for correcting errors in estimated spectral probabilities. 
In practice, the value of K can be estimated from train- 
ing data sets. 

The first upper bound of spectral probabilities 

Based on Equation (2) and the union bound of probabil- 
ities, 

SpecProb(S, t,l)<Y^Yl, ^"^^^ = d and CScore(S, Q) > t). 

Let q denote the right hand part of the above inequal- 
ity. The value of q is an upper bound of SpecProb(5, t, 
1). Next, we describe a dynamic programming algorithm 
for computing the value of q. The algorithm is an exten- 
sion of the generating function method in [8]. In this 
algorithm, a spectrum S with a residue mass N is repre- 
sented as a 0/1 vector S = S1S2 . . . Sjv, where = 1 if the 
spectrum has a prefix residue mass i and 0 otherwise. 
For example, a spectrum with a PRM list {2, 5, 8, 10} 
(10 is the residue mass of the spectrum) is represented 
as 0100100100. We first study the case where all mass 
shifts are positive; negative mass shifts will be discussed 
at the end of this subsection. A three dimensional table 
T (/, /, k) is computed to acquire the upper bound, 
where / is the number of PTMs in modified proteins. 
Let ^[1 : /] be the subspectrum S1S2 . . ■ Sj of S. The resi- 
due mass of S[l : j] is /. The value T (0, j, k) is the prob- 
ability that M(P) = / and the mass count score CScore(S' 
[1 : J], P) = k. Let Vj be set of all proteins with a residue 
mass 7. We define a function: P, k) = 1 if CScore(5, 
P) = k, 0 otherwise. Then, 

no,j,k) = j2^TiP).mi:j],p,k). 

Suppose P contains m amino acids and the residue 
mass of P is J. If the last amino acid of P is r, then y - 
llrll is the prefix residue mass of the first m - 1 amino 
acids of P, where llrll is the residue mass of r. In the 
vector representation of S, if S contains a prefix resi- 
due mass 7 - llrll, Sy-iini = 1; otherwise, Sj_\\r\\ = 0. The 
recurrence function for computing T{0, 7, k) was given 
in [8]: 

no,hk) = ^T(0,i - llrll, fe - 5;_||,||) Pr(r). (5) 

reTZ 

Let Dj =7 - M{P ), the random variable representing 
the difference between 7 and the residue mass of 
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random protein P. The value T{1, j, k) is the sum of 
probabilities 



= ^ ^ = d and CScore(Sll : j|, Q) = k) 

= E E E Pr(P)./(S|l:jl,Q)=*). 



(6) 



Suppose the residue mass of protein P is j - d, that is, 
P e Pj-d. Let be the number of amino acids in P and 
Qm,d the modified protein of P whose PTM is on the 
last amino acid. Because the first m-1 masses of Qm,d 
are unchanged compared with P, 

f{S[l : i],Q„,4,k) = f{S[l : j - d],P,k). 

Combined with Equation (4), 



E E Pr(P)-/(S[l:j],Q™,rf,fe>= E^fO'J-^fe)- 



(7) 



Let r be the last amino acid of P and P the protein con- 
taining the first m-1 amino acids of P. All the m-1 
masses in Qi^{P), 1 < I < m - 1, are the same to the first 
m-1 masses in Qi^iP). While the m - 1th mass / - ||r|| in 
Qi.diP) is included in the computation of mass count 
scores, the mass ; - llrll in Qi^{P) is not included because 
it is the residue mass. Thus, 

CScore(S|l ; ;], Qw(P)) = CScore(S|l : j- HrlH.QwCP')) + 5,_||r||- 

It follows 

/(S|l :;],Qw(P),k) =/(S|l : j - l|r| ||, Qw(n k - Shwi). (8) 

Combining the fact that Pr(P) = Pr{P')Pr(r) and Equa- 
tions (6) and (8), 

m-1 

E E EPf(P)-/(sii-jl'Qw''') = Z;'^(i'J-ll'-|l'''-*Hirii)P>-W. (9) 

With Equations (6), (7) and (9), the recurrence func- 
tion for T{1, j, k) is 

nw-fe) = E^(i'j-ii'"ii''^-^/-iwi)p>-w+ E no,i-d,k). (j^q) 

When PTMs with negative mass shifts d are allowed, /' 
-dm Equation (10) is larger than /. The value T{1, j - d, 
k) has not been computed when T{1, j, k) is computed, 
making Equation (10) invalid. To avoid this problem, a 
short amino acid sequence g is introduced to guarantee 
that j - d - M(g) < j. Let be the set of all amino acid 
sequences g = r-iV-z . . .ri satisfying M(g') > -d and M(rir2 . 
. . r/_i) < -d (d is negative). Equation (10) is modified to 



ni,j,k) = j2'^(i,j-\\r\\,k-i 
+ j2no,i-d,k) 



ll)Pr('-) 



(11) 



where ll^ll is the residue mass of g, and 
Prfe) = n'=i Pr(ri) for a sequence g = rir2 . . . ri. The 
value of q is Ylk=t T{1,N, k), where N and n are the resi- 
due mass and the number of masses of S, respectively. 
The time complexity for computing 7(0, /, k) and T{1, j, 
k) is 0(N ■ t ■ z), where z is the sum of the sizes of V* 
Bind all gd,z=\V*\+j:deV-\Od\. 

The second upper bound of spectral probabilities 

The only difference between two modified proteins Qi^a 
and Q/+i,j is the ith mass. If /?, in P (which is not changed 
in Qi+i,d) does not equal any mass in S, then CScore(S', Q,- 
+i,d) < CScore(S', Q,,^). Based on this observation, if pi 
does not equal any mass in S, Qi+\,d is removed from Q^. 
In this way, a new set is acquired containing Qi^ and 
all Qi,d satisfying that equals a mass in S. It follows 
that PScore(S, P) = maxq^Qj CScore(S, Q) = maxqgQj CScore(S, Q). 
From Equation (1) and the union bound of probabilities, 

SpecProb(S, t, 1) < E E ^^^^ = ^ ^"'^ CScore(S, Q) > t). 

Let q denote the right hand part of the above inequality. 
Compared with q, the value of q is a better upper bound 
for SpecProb(S, t,l). Similar to the method for computing 
q, we fill out a three dimensional array T{i, j, k) for comput- 
ing q. The recurrence function for filling out T{0, j, k) is the 
same to Equation (5). We change the definition of T{1, j, k) 
by replacing with Q*^ in Equation (6). To compute T{1, j, 
k), the second and third terms of the right-hand part of 
Equation (11) should be changed so that only the probabil- 
ities for the modified proteins in are summed up. 

Similar to the proof of Equation (7), consider a random 
protein P € Vj-d- Let Qm,d be the modified protein of P 
whose PTM is on the last amino acid, and r the last amino 
acid of P. If Qm,d e QJ, then j - d - llrll is a mass in S or / 
- d - llrll = 0 (in the extreme case that P contains only 
one amino acid, / - <i - ||r|| = 0), and vice versa. Therefore, 
iij-d- llrll = 0 or Sj^a-WrW = 1, then Pr(P ) 'JiSll : J], 
a, k) is included in the computation of T{1, j, k). 

For a positive mass shift d, we define TZj.d as the set of 
amino acids r G R satisfying that - <^ - llrll = 0 or the 
element Sj_d-i\rii = 1- For a negative mass shift d, we 
introduce a set of amino acid sequences g = r\r2 ■ ■ . 
ri satisfying: (1) M(g) > -d, (2) M(rir2 . . . r/_i) < -d, 
and (3) y - (i - ll^ll = 0 or the element Sj_a-\\g\\ = 1- Then 
Equation (11) is changed to: 



r(l,i,fe) = ^r(l,i- ||r||,fe-Sj_||r||)Pr(r) 

+ E E no,i-d- llrll, fe)Pr{r) ^^^^ 



+ E J2'^i°'>-'^-^^S\l,k)Pr{g), 



+ E E mj-d-\\g\\,k)Pi{g), 
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and t^' = Ylk^t t). The time complexity for com- 

puting T(0, j, k) and T{\, j, k) is similar to the method 
in the previous subsection. 

Since the scores CScore(5', Q) for Q e are not 
independent, q' is usually larger than the spectral prob- 
ability SpecProb(5, t, 1). To estimate SpecProbCS, t, 1) 
more accurately, q is multiplied by a constant value K 
for correction: 

SpecProb{s,£, 1) ^ Kq' . (13) 
P-values and £-values 

Let J\f = {N + d : d € V], where N is the residue mass of 
S. From table T{Q, j, k) described in the previous subsec- 
tion, we can compute the probability that the residue 
mass difference D between S and P is in D: 

n 

Pr{DeI?) = ^^r(0,j,fe). (14) 

Using Equations (13) and (14), the conditional spectral 
probability of S with respect to threshold t and one PTM is 

CSP(S, t, 1) = Pr(PScore(S, P) > t|D eV)^ p,^D€Vy ^^^^ 

Intact proteins may have N or C-terminal truncations, 
e.g., the removal of a signal peptide. If a top-down MS/ 
MS spectrum is matched to an intact protein without 
N- or C-terminal truncations, the PrSM is called a com- 
plete PrSM. A PrSM matched to an intact protein with 
an N-/C-terminal truncation is called a suffix/prefix 
PrSM. An internal PrSM corresponds to an intact pro- 
tein with both N- and C-terminal truncations. 

Similar to the £-values defined in BLAST [25], the E- 
value of a PrSM describes the number of hits one can 
"expect" to see by chance when searching the spectrum 
against a protein database of a particular size. Suppose a 
complete PrSM contains one mass shift (PTM) in J) 
and its PTM mass count score is t. We count the num- 
ber Z of proteins in the target database with a residue 
mass in J\f. The £-value of the complete PrSM is esti- 
mated as Z • CSP(S, t, 1). The p-value of the PrSM is 
estimated as 1 - (1 - CSP(S, t, 

For prefix, suffix and internal PrSMs, we count the 
numbers Zp, Zj, and Z, of prefixes/ suffixes/internal sub- 
proteins in the target database with a residue mass in 
J\f. Because some prefixes/suffixes/internal sub-proteins 
are not independent, a constant factor CplCJCi is multi- 
plied in the computation of £- values of prefix/suffix/ 
internal PrSMs for correction. For example, if a prefix 
PrSM contains one mass shift (PTM) in V and its PTM 
mass count score is t, the £-value of the PrSM is esti- 
mated as Cp- Zp - CSP(S, t, 1). 



Multiple PTMs 

The dynamic programming algorithm for computing the 
second upper bound can be extended to estimate £- 
values of PrSMs with multiple PTMs. When multiple 
PTMs are allowed, we replace r(0, ;', K) and 7(1, /, k) in 
Equation (12) by T{i, j, k) and T{i - 1, /, k) to estimate 
spectral probabilities with respect to i PTMs: 

+ E E - -d-\\r\\. k) Pr(r) 

+ E J2'^^'-'^'i-'^-\\s\lk)PT(g), 

Results 

The extended generating function method, TD-GF 
(Top-Down Generating Function), was implemented in 
JAVA and tested on a desktop with a 3.3GHz (AMD 
Opteron 6204) CPU and 16 GB memory. 

Data sets 

A Salmonella typhimurium (ST) data set [13] was used 
to test TD-GF. A protein mixture of ST was analyzed 
using an LTQ-Orbitrap (Thermo Fisher Scientific). MS 
and MS/MS spectra were collected at a resolution of 
60,000 and 30,000, respectively. The experiment was 
repeated using gas-phase fractionation. A total of 14,041 
collision-induced dissociation (CID) MS/MS spectra 
were acquired. The detailed experiment procedure can 
be found in [13]. 

The performance of TD-GF on proteoform identifica- 
tion was tested on an Escherichia coli (EC) data set. An 
EC cell lysate was separated by an intact protein 
reversed phase liquid-chromatography (RPLC) system 
and analyzed by an LTQ-Orbitrap Velos (Thermo Fisher 
Scientific). MS and MS/MS spectra was collected at a 
resolution of 60,000. A total of 3,704 higher-energy C- 
trap dissociation (HCD) MS/MS spectra were obtained. 

Spectral probabilities for PrSMs with one PTM 

The accuracy of TD-GF was evaluated using two 
approaches based on conditional spectral probabilities 
(defined in Equation (15)) and FDRs. 
Evaluation based on conditional spectral probabilities 
To evaluate TD-GF, we generated a set of PrSMs with 
"correct" conditional spectral probabilities and com- 
pared the "correct" conditional spectral probabilities 
with those reported by TD-GF. 

Selection of PrSMs Previous analysis results in [17] 
showed that most PrSMs identified in the ST data set 
contained no PTMs. To increase the number of identi- 
fied PrSMs with one PTM, a mutated ST protein 
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database was generated by adding a glycine residue to 
the middle of each protein sequence in the ST pro- 
teome. When the mutated ST protein database is used, 
a PrSM without PTMs can be identified as a PrSM with 
one PTM. 

All MS/MS spectra in the ST data set were deconvo- 
luted using MS-Deconv [20] and searched against the 
mutated ST proteome using MS-Align+ [17]. The error 
tolerances for precursor masses and fragment masses were 
set to 15 ppm, and carbamidomethylation was set as the 
fixed PTM. By restricting the search space to only com- 
plete PrSMs with one PTM, MS-Align+ identified 4,291 
PrSMs. For each of 4,291 PrSMs, TD-GF was employed to 
compute the conditional spectral probability, which was 
used only for selecting PrSMs, not for evaluating TD-GF. 
The parameter K in Equation (13) was set to 1. Since blind 
PTM search was used in MS-Align+, the allowed mass 
shifts were set to T)~ = {—a, —a + I, . . . , —1} and 
T>* = {1,2, ... ,a], where a is the mass of a tryptophan 
(W) residue. The running time for computing conditional 
spectral probabilities was 726 minutes (about 12 hours). 
For 203 of the 4,291 complete PrSMs, the conditional 
spectral probabilities reported by TD-GF were in [10 ^ 10 
"^] . The 203 PrSMs were selected for computing "correct" 
conditional spectral probabilities. 

Computation of "correct" conditional spectral probabil- 
ities For each of the 203 PrSMs (spectra), a random 
database of lO'' proteins was generated. In the random 
database, the residue masses of all proteins are in 
{N + d : d e V}, where N is the residue mass of the spec- 
trum. The PTM mass count score between the spectrum 
and each protein in the database was computed; and the 
number x of proteins satisfying that the PTM mass 
count score > t was counted. The conditional spectral 
probability of the PrSM with respect to one PTM and 
threshold t was estimated as x/lQ^. Since the above 
method follows the definition of conditional spectral 
probabilities, the results are treated as "correct" condi- 
tional spectral probabilities. Finally, one PrSM was 
removed from the list of 203 PrSMs because the esti- 
mated conditional spectral probability (using a random 
database) was 0. 

Evaluation of TD-GF The 202 PrSMs were randomly 
divided into a training data set (101 PrSMs) and a test 
data set (101 PrSMs). The training data set was used to 
estimated the value of K in Equation (13). We set K = 1 
(the value of K will be determined later) and used TD- 
GF to compute the conditional spectral probabilities for 
the training PrSMs. Let pi and P2 be the conditional 
spectral probabilities of a PrSM estimated by the ran- 
dom database-based method and TD-GF, respectively. 
The error of p2 is defined as e = | \og(pi) - log(/?2)| 
(base 10). To minimize the average error of the condi- 
tional spectral probabilities reported by TD-GF, the best 



value of log(/<0 is the average of the log ratio 
log(^) = log(pi) - log(p2). Using the training data set, 
K was set to the best value 0.55. In practice, the default 
values of K are learned from various types of training 
data, such as CID and ETD data, and are provided so 
that the users do not need to estimate K for their own 
data sets. With K = 0.55, TD-GF was employed to com- 
pute the conditional spectral probabilities for the test 
PrSMs. The errors of these conditional spectral prob- 
abilities were obtained by comparing them with the 
"correct" ones (Figure 1). The errors for 98 test PrSMs 
(97%) were < 0.5. When the error is 0.5, there is about 
a three fold difference between the conditional spectral 
probabilities reported by the two methods. The results 
show that the spectral probabilities estimated by TD-GF 
are accurate for most of the test PrSMs. 
Evaluation based on FDRs 

With the spectral probabilities reported by TD-GF, the 
"estimated" FDR of a set of identified PrSMs for a cut- 
off p-value can be computed using the functions in [7]. 
For the same cut-off p-value, the "correct" FDR can be 
obtained by the target-decoy approach. Because the 
"estimated" FDR is based on the spectral probabilities 
reported by TD-GF, if the "estimated" FDR is similar to 
the "correct" FDR, then the spectral probabilities 
reported by TD-GF are accurate. 

Using all the 4,291 complete PrSMs with one PTM, 
we computed "estimated" FDRs for cut-off jj-values in 
{0.0001, 0.0002, . . ., 1.0000} based on spectral probabil- 
ities. In the target-decoy approach, all spectra were 
searched against a concatenated target and shuffled 
decoy protein database. Because the FDR reported by 
the target-decoy approach was 0 when the cut-off p- 
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Figure 1 A comparison of thie conditional spectral probabilities 
(for PrSIVls with one PTIVI) estimated by the random database- 
based method and TD-GF. For each of the 101 test PrSMs, the 
error of the conditional spectral probability reported by TD-GF is 
computed. For each cut-off of e, the number of PrSMs with an error 
< e is counted. 
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Figure 2 A comparison of tfie FDRs of PrSIVls witfi one PTM 
estimated by the target-decoy approach and computed based 
on spectral probabilities. For a given cut-off p-value, the two 
reported FDRs are compared, and - log(FDR) (base 10) is plotted 
against - log(cut-off p-value) (base 10). 



value was smaller than 8.27 x 10^'*, we compared only 
the FDRs for cut-off /7-values greater than 8.27 x 10" 
(Figure 2). The FDRs estimated based spectral probabil- 
ities were consistent with those reported by the target- 
decoy approach. For example, the target-decoy approach 
and the spectral probability approach reported cut-off p- 
values 0.0327 and 0.0262 for 1% FDR, respectively. The 
difference between the two /i-values is only 0.0065, 
which is evidence that the spectral probabilities reported 
by TD-GF are accurate. 

Prefix, suffix and internal PrSIVls 

In this subsection, we describe the methods for estimat- 
ing parameters Cp, Cj and C, introduced in Section 
Methods. A substring ... fly of a string S = a-^a-i . 

. . a„ is denoted by S\i : j]. To estimate the parameter 
Cp for prefix PrSMs, a new random protein database 
was generated for each of the selected 202 PrSMs: (1) a 
total of 1,000 long random protein sequences with 1,200 
amino acids each were generated, and (2) prefixes 5[1 : 
201], . . ., ^[1 : 1200] were extracted from each of the 
1,000 long protein sequences. In total, 10^ prefixes were 
added to the random protein database. The conditional 
spectral probabilities estimated using the new random 
databases are different from those using the random 
databases in Subsection "Computation of correct condi- 
tional spectral probabilities" because the protein 
sequences in the new random databases are not inde- 
pendent. Parameter Cp was estimated as the average 
ratio 0.693 between the probabilities computed based on 
the new databases and the random databases in Subsec- 
tion "Computation of correct conditional spectral prob- 
abilities" for the 202 PrSMs. Parameter Cj can be set to 
the same to Cp. 



To estimated the parameter C, for internal PrSMs, a 
third type of random protein databases were used: (1) a 
total of 4 long random protein sequences with 1200 
amino acids each were generated, and (2) 2.5 x 10^ sub- 
strings S[i : j] (1 < / < 500, i+201 < J < i + 700) of the 
each long protein sequence were added to the random 
database. Using the same method for computing Cp, 
parameter C, was estimated as 0.508. 

Spectral probabilities for PrSIVls with two PTMs 

Similar to PrSMs with one PTM, a mutated protein 
database was created to increase the number of identi- 
fied PrSMs with two PTMs. Two glycine residues were 
added each protein in the ST protein database: one is at 
the one-third position of the protein; the other at the 
two-thirds position. MS-Align+ identified 2,404 com- 
plete PrSMs with two PTMs, and TD-GF was used to 
compute the spectral probabilities for the 2,404 PrSMs. 
The running time for computing spectral probabilities 
was 1,317 minutes (about 22 hours). Because it is 
extreme slow to find the best PrSM with two PTMs by 
searching a spectrum against a large random protein 
database with 10 proteins, the evaluation method based 
on conditional spectral probabilities was not used. Only 
the evaluation method based on FDRs was applied. 
With all the 2,404 identified PrSMs, FDRs based on 
spectral probabilities and based on the target-decoy 
approach were computed for cut-off /^-values in {0.0001, 
0.0002, . . ., 1.0000). When the cut-off j?-value is smaller 
than 0.016 (- log p-value >1.80), the FDRs estimated by 
the two methods are similar. For 1% FDR, the target- 
decoy approach and the spectral probability approach 
estimated similar cut-off /?-values 0.0164 and 0.0116, 
respectively. However, the FDRs based on spectral prob- 
abilities are not consistent with the "correct" FDRs 
(reported by the target-decoy approach) when the cut- 
off p-value is larger than 0.016 (Figure 3). One possible 
reason is that the filtering method implemented in MS- 
Align+ fails to keep the best PrSMs when their /(-values 
are not small enough. From the above analysis, the spec- 
tral probabilities estimated by TD-GF are accurate when 
they are smaller than 0.016. 

Comparison with ProSightPC 

All MS/MS spectra in the EC data set were deconvo- 
luted by MS-Deconv [20]. The EC proteome database 
was downloaded from the Swiss-Prot database; a com- 
bined protein database was generated by concatenating 
the EC proteome database and a shuffled decoy data- 
base. To test the performance of TD-GF on proteoform 
identification, MS-Align-i- coupled with TD-GF was 
applied to search the deconvoluted spectra against the 
combined database. The error tolerances for precursor 
masses and fragment masses were set as 15 ppm and 
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Figure 3 A comparison of the FDRs of PrSMs with two PTiVls 
estimated by the target-decoy approach and computed based 
on spectral probabilities. For a given cut-off p-value, the two 
reported FDRs are compared, and - log(FDR) (base 10) is plotted 
against - log(cut-off p-value) (base 10). 



two unknown PTMs were allowed. Using 1% spectrum- 
level FDR, 1,478 spectra were identified. 

ProSightPC [10] was also applied to analyze the EC data 
set. ProSightPC provides several search modes for top- 
down spectral identification, including the absolute mass 
mode and the biomarker mode. Since some spectra in the 
EC data set were generated from truncated proteins, the 
biomarker mode was chosen for the analysis of the EC 
data set. The error tolerances for precursor masses and 
fragment masses were set as 15 ppm. ProSightPC identi- 
fied 627 spectra with 1% spectrum-level FDR. All the 627 
spectra were identified by MS-Align-F coupled with TD- 
GF. The test results show that MS-Align-i- coupled with 
TD-GF outperformed the biomarker mode of ProSightPC. 

Conclusions 

The experiments showed that the extended generating 
function method achieves high accuracy in computing 
spectral probabilities of PrSMs with PTMs. It is a non- 
trivial extension of the generating function method pro- 
posed in [8]. With accurate spectral probabilities and E- 
values, one can easily choose the correct PrSM from 
several candidate PrSMs for a spectrum, as well as sepa- 
rate correct PrSMs from incorrect ones identified from 
a large number of spectra. In addition, it provides a way 
to evaluate single PrSMs efficiently. 
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